{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Interacting with HopsFS\n",
    "\n",
    "HopsFS is a fork of the Hadoop Distributed File System (HDFS). \n",
    "\n",
    "To see what distinguishes HopsFS from HDFS from an architecural point of view refer to:\n",
    "\n",
    "- [blogpost](https://www.logicalclocks.com/introducing-hops-hadoop/)\n",
    "- [papers](https://www.logicalclocks.com/research-papers/)\n",
    "\n",
    "To interact with HopsFS from python, you can use the hdfs module in the hops-util-py library, it provides an easy-to-use API that resembles interaction with the local filesystem using the python `os` module. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import the Module"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hops import hdfs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting Project Information\n",
    "\n",
    "When interacting with HopsFS from Hopsworks, you are always inside a **project**. When you are inside a project your activated HDFS user will be projectname__username. This is to set project-specific access control and multi-tenancy (you can read more about the low-level details here: [hopsworks blogpost](https://www.logicalclocks.com/introducing-hopsworks/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "project_user = hdfs.project_user()\n",
    "project_name = hdfs.project_name()\n",
    "project_path = hdfs.project_path()\n",
    "print(\"project user: {}\\nproject name: {}\\nproject path: {}\".format(project_user, project_name, project_path))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read/Write From/To HDFS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "logs_README = hdfs.load(\"Logs/README.md\")\n",
    "print(\"logs README: {}\".format(logs_README.decode(\"utf-8\")))\n",
    "hdfs.dump(\"test\", \"Logs/README_dump_test.md\")\n",
    "logs_README_dumped = hdfs.load(\"Logs/README_dump_test.md\")\n",
    "print(\"logs README dumped: {}\".format(logs_README_dumped.decode(\"utf-8\")))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Copy Local FS <--> HDFS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# creates file in current working directory with a string\n",
    "with open('test.txt', 'w') as f:\n",
    "    f.write(\"test\")\n",
    "hdfs.copy_to_hdfs(\"test.txt\", \"Resources\", overwrite=True)\n",
    "hdfs.copy_to_local(\"Resources/test.txt\", overwrite=True)\n",
    "hdfs_copied_file = hdfs.load(\"Resources/test.txt\")\n",
    "with open('test.txt', 'r') as f:\n",
    "    local_copied_file = f.read()\n",
    "print(\"copied file from local to hdfs: {}\".format(hdfs_copied_file.decode(\"utf-8\")))\n",
    "print(\"copied file from hdfs to local: {}\".format(local_copied_file))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## List Directories"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "logs_files = hdfs.ls(\"Logs/\")\n",
    "print(logs_files)\n",
    "logs_files_md = hdfs.glob(\"Logs/*.md\")\n",
    "print(logs_files_md)\n",
    "logs_path_names = hdfs.lsl(\"Logs/\")\n",
    "print(logs_path_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Copy Within HDFS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hdfs.cp(\"Resources/test.txt\", \"Logs/\")\n",
    "logs_files = hdfs.ls(\"Logs/\")\n",
    "print(logs_files)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create and Remove Directories"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hdfs.mkdir(\"Logs/test_dir\")\n",
    "logs_files_prior_delete = hdfs.ls(\"Logs/\")\n",
    "print(\"files prior to delete: {}\".format(logs_files_prior_delete))\n",
    "hdfs.rmr(\"Logs/test_dir\")\n",
    "logs_files_after_delete = hdfs.ls(\"Logs/\")\n",
    "print(\"files after to delete: {}\".format(logs_files_after_delete))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Move/Rename Files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "logs_files_prior_move = hdfs.ls(\"Logs/\")\n",
    "print(\"files prior to move: {}\".format(logs_files_prior_move))\n",
    "hdfs.move(\"Logs/README_dump_test.md\", \"Logs/README_dump_test2.md\")\n",
    "logs_files_after_move = hdfs.ls(\"Logs/\")\n",
    "print(\"files after move: {}\".format(logs_files_after_move))\n",
    "logs_files_prior_rename = hdfs.ls(\"Logs/\")\n",
    "print(\"files prior to rename: {}\".format(logs_files_prior_rename))\n",
    "hdfs.rename(\"Logs/README_dump_test2.md\", \"Logs/README_dump_test.md\")\n",
    "logs_files_after_rename = hdfs.ls(\"Logs/\")\n",
    "print(\"files after move: {}\".format(logs_files_after_rename))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Change Owner and Change Mode"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import stat\n",
    "file_stat = hdfs.stat(\"Logs/README.md\")\n",
    "print(\"file permissions prior to chmod: {0:b}\".format(file_stat.st_mode))\n",
    "hdfs.chmod(\"Logs/README.md\", 700)\n",
    "file_stat = hdfs.stat(\"Logs/README.md\")\n",
    "print(\"file permissions after to chmod: {0:b}\".format(file_stat.st_mode))\n",
    "hdfs.chmod(\"Logs/README.md\", 777)\n",
    "file_owner = file_stat.st_uid\n",
    "#print(\"file owner prior to chown: {}\".format(file_owner))\n",
    "#hdfs.chown(\"Logs/README.md\", \"meb10000\", \"meb10000\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## File Metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "file_stat = hdfs.stat(\"Logs/README.md\")\n",
    "print(\"file_stat: {}\".format(file_stat))\n",
    "file_access = hdfs.access(\"Logs/README.md\", 777)\n",
    "print(\"file access: {}\".format(file_access))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get Absolute Path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "abs_path = hdfs.abs_path(\"Logs/\")\n",
    "print(abs_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Check if file/folder exists"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hdfs.exists(\"Logs/\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hdfs.exists(\"Not_Existing/neither_am_i\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "PySpark",
   "language": "",
   "name": "pysparkkernel"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "python",
    "version": 2
   },
   "mimetype": "text/x-python",
   "name": "pyspark",
   "pygments_lexer": "python2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
